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Abstract. For data assumed to come from a finite mixture with an unknown 
number of components, it has become common to use Dirichlct process mix- 
tures (DPMs) not only for density estimation, but also for inferences about 
the number of components. The typical approach is to use the posterior distri- 
bution on the number of components occurring so far — that is, the posterior 
on the number of clusters in the observed data. However, it turns out that 
this posterior is not consistent — it does not converge to the true number of 
components. In this note, we give an elementary demonstration of this incon- 
sistency in what is perhaps the simplest possible setting: a DPM with normal 
components of unit variance, applied to data from a "mixture" with one stan- 
dard normal component. Further, we find that this example exhibits severe 
inconsistency: instead of going to 1, the posterior probability that there is one 
cluster goes to 0. 



1. Introduction 

It is well-known that Dirichlet process mixtures (DPMs) of normals are consistent 
for the density — that is, given data from a sufficiently regular density po the 
posterior converges to the point mass at po (see [17, 4] for details and references). 
However, it is easy to see that this does not imply consistency for the number 
of components, since for example, a good estimate of the density might include 
superfluous components having vanishingly small weight. 

Despite the fact that a DPM has infinitely many components with probability 1, 
it has become common to apply DPMs to data assumed to come from a finite 
mixture, and to apply the posterior on the number of components used to generate 
the observed data (in other words, the posterior on the number of clusters in the 
data) for inferences about the true number of components (see [1, 16, 14, 10, 8, 18, 9] 
for a few prominent examples). Thus, it is important to understand the properties 
of this procedure. 

In this note, we give a simple example in which a DPM is applied to data from 
a finite mixture and the posterior distribution on the number of clusters does not 
converge to the true number of components. In fact, DPMs exhibit this type of 
inconsistency under very general conditions, as we will show elsewhere — however, 
the aim of this note is brevity and clarity. To this end, we focus our attention on 
a special case that is as simple as possible: a "standard normal DPM" , that is, a 
DPM using univariate normal components of unit variance, with a standard normal 
base measure (prior on component means). 
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Some authors have empirically observed that the DPM posterior tends to overes- 
timate the number of components (e.g. [16, 9, 13], among others), and have found 
that ignoring tiny clusters tends to mitigate this issue. It might be possible to 
obtain consistent estimators in this way. However, by adopting such a procedure, 
one is abandoning the DPM model, and it is not clear what model (if any) would 
give rise to such a procedure. 

A more natural way to obtain consistency is the following: if the number of 
components is unknown, put a prior on the number of components. For exam- 
ple, draw the number of components s from a probability mass function p(s) on 
{1, 2, . . . } with p(s) > for all s, draw mixing proportions 7r = (7Tl, . . . , 7r s ) from 
an ,s-dimcnsional Dirichlet (given s), draw component parameters 6i,...,8 s i.i.d. 
(given s and tt) from an appropriate prior, and draw X\, X2,. . . i.i.d. (given s, 
7r, and 9i :s ) from the resulting mixture. This approach has been widely used 
[11, 15, 6, 12]. Strictly speaking, as defined, such a model is not identifiable - 
but it is fairly straightforward to modify it to be identifiable by choosing one repre- 
sentative from each equivalence class. Subject to a modification of this sort, it can 
be shown (see e.g. [11]) that under very general conditions such models are (a.e.) 
consistent for the number of components, the mixing proportions, the component 
parameters, and the density (for data from a finite mixture of the chosen family). 
It is a common misperception that efficient (approximate) inference is more dif- 
ficult in these models than in DPMs — to the contrary, we have found that an 
appropriately constructed MCMC sampler for such a model is nearly identical to 
an MCMC sampler for a DPM. Further details will be provided elsewhere, since 
they are beyond the scope of this note. 

The rest of the paper is organized as follows. In Section 2, we define the DPM 
model under consideration. In Section 3, we give an elementary proof of inconsis- 
tency for a standard normal DPM. In Section 4, we show (using Hocffding's strong 
law of large numbers for U-statistics) that this example is in fact severely inconsis- 
tent, in the sense that the posterior probability of the true number of components 
goes to 0. 



2. Setup 

In this section, we define the Dirichlet process mixture model. 

2.1. Dirichlet process mixture model. The DPM model was introduced by 
Ferguson [3] for the purpose of Bayesian density estimation, and was made practical 
through the efforts of several authors (see [2] and references therein). We will use 
p(-) to denote probabilities under the DPM model (as opposed to other probability 
distributions that will be considered in what follows). The core of the DPM is 
the so-called Chinese restaurant process (CRP), which defines a certain probability 
distribution on partitions. Given n G {1,2,...} and t € {1, . . . , n}, let Atip) denote 
the set of all ordered partitions (Ax, . . . , A t ) of {1, . . . , n} into t nonempty sets. In 
other words, 

t 

A t {n) = {(A u ...,A t ) : A u . . . , A t axe disjoint, (J A. t = {l,...,n}, \A t \ > 1 Vij. 

i=l 
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The CRP with concentration parameter a > defines a probability mass function 
on A(n) = U"=i -At(n) by setting 

^)=^ri(i^i-i)! 

l—l 

for A £ At(n), where — 1). Note that since i is a function 

of A, we have p(A) = p(A,t). (It is more common to see this distribution defined 
in terms of unordered partitions {Ai, . . . , A t }, in which case the t\ does not appear 
in the denominator — however, for our purposes it is more convenient to use the 
distribution on ordered partitions (A\,...,At) obtained by uniformly permuting 
the parts. This does not affect the prior or posterior on t.) 
Consider the hierarchical model 

(2-1) p(A,t)=p(A) = -^J[(\A i \-l)\, 

' i=i 

t 

p(9 1:t | A,t) = Y[p(0i), and 

i=l 

t 

P(xi :n | 9 1:t ,A,t) = J J P6 ( (Xj), 

i=lj£Ai 

where p(ff) is a prior on component parameters 6 € O, and {pg : 6 £ 0} is a 
parametrized family of distributions on x £ X for the components. Typically, 
X C M d and C M fc for some d and k. Here, x\- n = (x±, . . . , x n ) with Xi £ X, 
and 6\-.t = {9\, . . . ,0 t ) with 9i £ 0. The marginal distribution on x± :n is called a 
Dirichlet process mixture (DPM) model. 

The prior on the number of clusters t under this model is p n (t) = ^,AeA t M p(A, t) . 
We use T n (rather than T) to denote the random variable representing the number 
of clusters, as a reminder that its distribution depends on n. Note the distinction 
between the terms "component" and "cluster" : a component is part of a mixture 
distribution, while a cluster is the set of (indices of) data points coming from a 
given component. 

Since we are concerned with the posterior distribution p(T n = t \ Xx-_ n ) on the 
number of clusters, we will be especially interested in the marginal distribution on 
(xun,t), given by 



(2.2) 




AeA t {n) i=l J jeAi 

t 

AeAtM i=i 
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where for any subset of indices S C {1, . . . , n}, we denote xs = (xj : j S S) and let 
m(xs) denote the single-cluster marginal of xs, 

(2.3) m(x s )= j (Y[pe(x j ))p(0)d0. 

2.2. Specialization to the standard normal case. In this note, for brevity and 
clarity, we focus on the univariate normal case with unit variance, with a standard 
normal prior on means — that is, for x £ K and 9 €E K, 

Pe{x) = N{x | 9, 1) = -= exp(-i(.T - 9) 2 ), and 
p(0)=A^|O,lH-Lcxp(-±0 2 ). 

V 27T 

It is a straightforward calculation to show that the single-cluster marginal is then 

(2.4) m(xu n ) = 2— p (x 1:n ) exp ), 

where po(xi :n ) = po(x!) ■ ■ -poixn) (and p is the 7V(0, 1) density). When p g (x) and 
p{9) are as above, we refer to the resulting DPM as a standard normal DPM. 



3. Elementary example of inconsistency 

In this section, we prove the following result, exhibiting a simple example in 
which a DPM is inconsistent for the number of components: the true number of 
components is 1, but the posterior probability of T„ = 1 does not converge to 1. 
To keep it simple, we set a. = 1, but the proof extends trivially to any a > 0. 

Proposition 3.1. If X\,X2, ■ . . ~ Af(0, 1) i.i.d. then with probability 1, under the 
standard normal DPM with a = 1 as defined above, p(T n — 1 | X\ :n ) does not 
converge to 1 asn->oo. 



Proof. Let n € {2,3,...}. Let xi,...,x„ e M, A e -4.2(n), and a; = for 
i = 1,2. Define s n = 52j=i Xj and sa ; = X^eA; x j f° r i = 1> 2. Using Equation 2.4 
and noting that l/(n + 1) < l/(?i + 2) + 1/n 2 , we have 

^Sfey--(^)^K^)-ai)- 

The second factor equals exp(i;c 2 ), where x n = — X)j=i Writing s„/ (n + 2) as 
a convex combination of s J 4 1 /(ai + 1) and SA 2 /{ a 2 + 1), by the convexity of x i— > x 2 
the first factor is less or equal to 

„2 



s Ai 1 s a 2 \ , --r , —. T m(x Al )m(xA 2 ) 

exp o — TT + o — TT = v«i + !v a 2 + 1 7 — \ — ■ 

V2ai + 1 2a2 + l/ Po{zi:n) 



Hence, 



(3.1) r < exp(H) 

V ' m{x Al )m(xA 2 ) ~ V^TT 2 nJ 
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Consequently, we have 



p(x 1:n ,T n = 1) A £^r n ) m ( x l:n) 



> > rip(A) — = exp( — 7tx„) 

~ ^ n\2\ y/2-i/n 2 

AeA 2 (n): v v 

|Ax|=l 

(d) 1 / 1-2, 

where step (a) follows from applying Equation 2.2 to both numerator and denomina- 
tor, plus using Equation 2.1 (with a = 1) to see that p(xi :n ,T n = 1) = m{xi :n )/n, 
step (b) follows from Equation 3.1 above, step (c) follows since all the terms in the 
sum are nonncgative and p(A) = (n — 2)!/n!2! when \A\\ = 1 (by Equation 2.1, 
with a = 1), and step (d) follows since there are n partitions A £ Aa, (n) such that 
|A X | = 1. 

If X\,X2,... ~ Af(0,l) i.i.d. then by the law of large numbers, X n = 
— X^=i Xj — > almost surely as n — > oo. Therefore, 

p(T„ = 1 | Ai : „) = =oo 7T; ™ IT < 



< 



Y,T=1 P( X Un, T„ = t) p(Xu„, T n = 1) + p(Xu n , T n = 2) 

1 a.s. 1 



" 1 + ^f eX P(-^n) 1 + i73 

Hence, almost surely, p(T n = 1 | Xi m ) does not converge to 1. □ 

Note that the only property of the 7V(0, 1) data distribution that we used was 
X n = ^J2"=i X j -> EX i e M - Thus : we could clearly have let X 1 ,X 2l ... be 
i.i.d. from any distribution with finite mean, and still p(T n = 1 | X\ :n ) would not 
converge to 1. 

4. Severe inconsistency 

In the previous section, we showed that p(T n = 1 | X\ :n ) docs not converge to 1 
for a standard normal DPM on standard normal data. In this section, we prove 
that in fact, it converges to 0. This vividly illustrates that improper use of DPMs 
can lead to entirely misleading results. The key step in the proof is an application 
of Hoeffding's strong law of large numbers for U-statistics. The proof generalizes 
easily to any a > 0. 

Theorem 4.1. If X u X 2 , . . . ~ Af(0, 1) i.i.d. then 

p(T n = 1 | Xi- n ) as n — > oo 
under the standard normal DPM with concentration parameter a = 1 . 
Proof. For t = 1 and t = 2 define 

Rt(X 1:n ) =n- s 



_ 3/2 P{X\:n,T n — t) 



Po{Xl;n) 



(i 
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(For general a > 0, replace r?l 2 above by n 3 / 2 a ( -™- ) /n\.) Our method of proof is as 
follows. We will show that 

Ri{X 1:n ) — oo 

n— >oo 

(or in other words, for any B > we have ¥(R2(Xu n ) > B) — > 1 as n — > oo), and 
we will show that Ri{X\ :n ) is bounded in probability: 

Ri(X 1:n ) = P (1) 

(or in other words, for any e > there exists B e > such that ¥(Ri(Xi :n ) > B e ) < 
e for all n G {1, 2, . . . }). Putting these two together, we will have 

p(T n = 1 | Xl . n ) = P {Xl r Tn = 1) - < P( X ^T n = l) = Ri(X 1:n ) 

l^t=lP(Xl:n,T n = t) p{X 1:n ,T n = 2) R 2 (X 1:n ) n^oo 

First, let's show that i? 2 (^i : n) —> oo in probability. For S C {l,...,n} with 
|5| > 1, define h{x s ) by 

m(a; s ) 1 /l 1 n 2 



3&S 



where m is the single-cluster marginal as in Equations 2.3 and 2.4. Note that when 
1 < IS] < n — 1, we have y/nh(xs) > 1. Note also that E/i(Xs) = 1 since 

E/i(X s ) = J h(x s )p (x s )dx s = J m(x s ) dx s = 1, 

using the fact that m{xs) is a density with respect to Lebesgue measure. For 
fc G {1, . . . , n}, define the U-statistics 

C4(x 1:n ) = 7^ E 

V&J |S|=* 

where the sum is over all S C {1, . . . , n} such that |S| = k. By Hocffding's strong 
law of large numbers for U-statistics [7] , 



U k {X 1:n )-^Eh(X ltk ) = l 

n— >oo 



for any k G {1,2,... }. Therefore, using Equations 2.1 and 2.2 we have that for any 
K G {1,2,...} and any n > K, 

R 2 (X 1:n )=n*/> £ p{A) rn(X Al )rn(X Aa ) 

AeA 2 (n) 

>n £ p(A)fc(X Al ) 
AeA 2 (n) 

^ ^ (fc-l)!(n-fc-l)! 

k=l \S\=k 
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n— 1 

E 2 jfc("-ft) (») £ 



fc=l v 7 |S|=fc 

n-1 

= E 2jfc(n n _ fc) ^(^-) 

A' 



a.s. 



l H K _ log A 



> ' — ' 2fc 

k=l 



where Hk is the A th harmonic number, and the last inequality follows from the 
standard bounds [5] on harmonic numbers: log A' < Hk < log A' + 1. Hence, for 
any K, 

log A 

liminf R2(Xi :n ) > — - — almost surely, 
and it follows easily that 

R2(X l:n ) -^^f oo. 

n— >oo 

Convergence in probability is implied by almost sure convergence. 

Now, let's show that Ri(Xi :n ) = Op(l). By Equation 2.1, p(A) = l/n when 
A = ({!,..., 77,}). Using this along with Equations 2.2 and 2.4, we have 



Ri(X v . n ) = 77 



, 2 p{X\- n ,T n = 1) _ r- m(Xi :n ) 
Po(Xl :n ) Po(Xl:n) 



'V 



where Z n = (1/y/n) £™ =1 X, ~ Af(0, 1) for each n G {1, 2, . . . }. Since Z n = P (1) 
then we conclude that R\(Xi :n ) = Op(l). This completes the proof. □ 
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